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The objective of many high-dimensional microarray and RNA-seq 
studies is to develop a classifier of cancer patients based on charac¬ 
teristics of their disease. The germinal center B-cell (GCB) classifier 
study in lymphoma and the National Cancer Institute’s Director’s 
Challenge lung (DC-lung) study are two examples. In recent years, 
such classifiers are often developed using regularized regression, such 
as the lasso. A critical question is whether a better classifier can be 
developed from a larger training set size and, if so, how large the 
training set should be. This paper examines these two questions us¬ 
ing an existing sample size method and a novel sample size method 
developed here specifically for lasso logistic regression. Both meth¬ 
ods are based on pilot data. We reexamine the lymphoma and lung 
cancer data sets to evaluate the sample sizes, and use resampling 
to assess the estimation methods. We also study application to an 
RNA-seq data set. We find that it is feasible to estimate sample size 
for regularized logistic regression if an adequate pilot data set exists. 

The CCB and the DC-lung data sets appear adequate, under specific 
assumptions. Existing human RNA-seq data sets are by and large in¬ 
adequate, and cannot be used as pilot data. Pilot RNA-seq data can 
be simulated, and the methods in this paper can be used for sample 
size estimation. A MATLAB program is made available. 

1. Introduction. Regularized regression methods, such as the lasso, are 
common in the analysis of high-dimensional data [Bi et al. (2014), Zwiener 
et al. (2014), Moehler et al. (2013), Zhang et al. (2013)]. Regularized logistic 
regression is often used to classify patients into different groups, such as those 
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who will versus will not respond to a targeted therapy. While development 
of classifiers is a long process [Dyrskjpt (2003), Pfeffer et al. (2009), Hanash, 
Baik and Kallioniemi (2011), McShane and Hayes (2012), Simon (2010)], 
a critical step in that process is determining the sample size necessary to 
adequately train a classiher from high-dimensional data. 

In this paper, we look at three microarray data sets to evaluate whether 
the sample sizes used were adequate. Also, we examine RNA-seq data and 
the potential to determine sample size for this newer technology. We reex¬ 
amine the data of Rosenwald et al. (2002). Under the assumption that the 
germinal center B cell (GCB) lymphoma subtype patients are correctly iden¬ 
tified in this study, we examine whether a better classifier can be developed 
using the lasso logistic regression and, if so, how large a training set would 
be needed to do so. Similarly, for the lung cancer data set of Shedden et 
al. (2008), the lasso logistic regression is used to evaluate whether a better 
outcome predictor could be developed from a larger data set. Third, we ex¬ 
amine a more dramatic classification difference comparing prostate tumors 
to normal prostate tissue [Dettling and Biihlmann (2003)]. We also pro¬ 
vide the assumptions on which these sample size estimates are based. These 
assumptions can be used to evaluate important public health planning ques¬ 
tions, such as whether re-running similar studies using RNA-seq technology 
is likely to yield improved classification. Finally, we examine an RNA-seq 
data set as a proof of principle to assess the potential of these methods on 
this new technology. We find that the methods can be effectively applied to 
RNA-seq data. 

What follows is a brief review of the methodology literature. A more 
extensive review for interested readers appears in the Supplement [Safo, 
Song and Dobbin (2015)]. 

The novel approach developed for lasso logistic regression in this paper 
uses errors-in-variables (EIV) regression. EIV methods for logistic regression 
include simulation extrapolation [SIMEX, Cook and Stefanski (1994)], con¬ 
ditional score [Stefanski and Carroll (1987)], consistent functional methods 
[Huang and Wang (2001)], approximate corrected score [Novick and Stefan¬ 
ski (2002)], projected likelihood ratio [Hanfelt and Liang (1995)] and quasi¬ 
likelihood [Hanfelt and Liang (1995)]. We discuss each approach briefly. The 
SIMEX EIV method adds additional measurement error to the data, estab¬ 
lishes the trend, and extrapolates back to the no error model using a fitted 
polynomial regression; as discussed in Cook and Stefanski (1994), evaluating 
the adequacy of a fitted regression requires judgment and is not automatic. 
The subjective fitting step can complicate algorithm implementation and 
Monte Carlo evaluation of performance. So we do not focus on SIMEX, al¬ 
though we do find SIMEX useful in settings where other EIV methods do not 
perform well. The consistent functional method is most valuable in large- 
scale studies [Huang and Wang (2000)], which are currently very rare in high 
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dimensions with sequencing or microarrays. The logistic model does not ht 
the corrected score smoothness assumptions; also, the Monte Carlo corrected 
score method is not consistent for logistic regression and implementation of 
the method requires programming software with complex number capabili¬ 
ties [Carroll et al. (2006)]. Quasi-likelihood methods can be challenging to 
implement for logistic regression. Conditional score methods, on the other 
hand, are computationally tractable and relatively easy to implement, and 
have shown good hnite-sample performance [Carroll et al. (2006)]. We found 
the sufficient statistics suggested by Hanfelt and Liang (1995, 1997) to be 
more stable for this application than the original conditional score, so we 
used this closely related approach. 

A practical question when using a sample size method that will be based 
on a pilot data set, rather than a parametric model, is whether the pilot 
data set is large enough. If the pilot data set is too small, then no classifier 
developed on it may be statistically significantly better than chance, which 
can be assessed with a permutation test [e.g., Mukherjee et al. (2003)]. But, 
even if the classifier developed on the pilot data set is better than chance, the 
pilot data set can still be too small to estimate the asymptotic performance 
as n —)■ oo well. This latter is a more complex question. But because it is 
practically important, guidelines are developed here for evaluating the pilot 
data set size. 

The sample size method developed in this paper and the one in Mukherjee 
et al. (2003) are based on resampling from a pilot data set or from a sim¬ 
ulated data set if no pilot is available. Resampling is used to estimate the 
logistic regression slopes for different sample sizes and the prediction error 
variances. Cross-validation (CV) [e.g., Geisser (1993)] is a well-established 
method for obtaining nearly unbiased estimates of logistic regression slopes. 
Because regularized regression already contains a cross-validation step for 
parameter tuning, estimating the logistic regression slope by cross-validation 
requires nested (double) cross-validation [e.g., Davison and Hinckley (1997)]. 
An inner cross-validation loop selects the penalty parameter value, which is 
then used in the outer loop to obtain the cross-validated classihcation scores. 
We also found it necessary to center and rescale individual CV batches, 
and repeat the CV 20-50 times to denoise the estimates. This process is 
termed repeated, centered, scaled cross-validation (RCS-CV). To estimate 
prediction error variances, the leave-one-out bootstrap (LOOBS) [Efron and 
Tibshirani (1997)] can be used. Modihcation of standard LOOBS is needed 
because of the cross-validation step embedded in the regularized regression. 
To avoid information leak, the prediction error variance is estimated by 
the leave-one-out nested case-cross-validated (LOO-NCCV-BS) bootstrap 
[Varma and Simon (2006)]. The same centering and scaling steps added for 
CV were also added to the LOO-NCCV-BS. We call this CS-LOO-NCCV- 
BS. 
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Regularized regression for high-dimensional data is a very active area of 
current research in statistics. Common methods include the lasso [Tibshi- 
rani (1996)], the adaptive lasso [Zou (2006)] and the elastic net [Zou and 
Hastie (2005)], among many others [Fan and Li (2001), Meier, van de Geer 
and Biihlmann (2008), Zhu and Hastie (2004)]. In this paper, the focus 
of the simulation studies is on the lasso logistic regression, with selection 
of the penalty parameter via the cross-validated error rate. Our sample size 
methodology can be used with other regularized logistic regression methods, 
but may require modifications, particularly if additional layers of resampling 
are involved (e.g., the adaptive lasso). 

In the study of lymphoma, we find that there is little room for improve¬ 
ment in the GCB signature. This means that patients who receive treatment 
based on this signature would be unlikely to have their treatment changed 
as a result of a much larger study using microarrays being conducted. Simi¬ 
larly, in the lung cancer application, we find that larger studies in lung cancer 
would not be likely to yield better survival prediction, despite the relatively 
poor performance of the classifier. This confirms that, unlike breast cancer, 
developing clinically useful gene expression-based lung cancer prognostic 
predictors is probably not feasible. In the prostate cancer data set, we find 
that the asymptotic accuracy of a classifier that distinguishes tumor from 
normal tissue is only around 88%. This accuracy may be viewed as lower 
than expected since tumor and normal tissues tend to be very different in 
most cancers; whether the low accuracy is due to the high heterogeneity of 
prostate tissue and prostate cancer tissue or to possible contamination of 
the normal samples with pre-cancerous or undiagnosed cancers, we could 
not assess. Finally, we find that these methods can be applied to RNA-seq 
data, although the novel method appears to give better estimates. But, un¬ 
fortunately, using our own criteria for the pilot sample size requirements, 
publicly available human RNA-seq data sets are inadequate. This informa¬ 
tion supports the notion that there is a critical need to make RNA-seq data 
more widely accessible to researchers so that they can plan their studies 
properly. We hope this information will move policy makers to place a pri¬ 
ority in finding solutions to the existing privacy concerns with these data 
sets. 

The paper is organized as follows: Section 2 presents the methodology. 
Section 3 presents the results of simulation studies. Section 4 presents the 
results of real data analysis and resampling studies. Section 5 presents dis¬ 
cussion and conclusions. 

2. Methods. 

2.1. The penalized logistic regression model. Each individual in a popu¬ 
lation V belongs to one of two classes, Cq and Ci. For individual i, let Yi = 0 
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if i € Co and = 1 if i G Ci. One wants to predict Yi based on observed 
high-dimensional data gi G 3?^ and clinical covariates 2 ^ G 3?'^. A widely used 
model for this setting is the linear logistic regression model, 

( 2 . 1 ) 'n-{gi,Zi) = P{Yi = l\gi,Zi) = {1 -LExp[-a - 5'Zi - 

where a G 3?^, 5 G 3?'^ and 7 G 3?^ are population parameters. 

The negative log-likelihood, given observed data {yi,Zi,gi) for i = 1,..., n, 
is 

n 

L(a, 5 , 7 ) = -'^{yi\n['K{gi,Zi)] -L (1 - yi)\ii[l -'n{gi,Zi)\}. 
i=l 

To estimate parameters and reduce the dimension of gi, a regularized re¬ 
gression is often fit. Coefficients are set to zero using the penalized negative 
log-likelihood function 

p 

(2.2) Tp0nalized(ck)7) —C(cr,(5, 7)T ^ ^ ^kf (Sfk\ 

k=l 

where Xk are penalty parameters and / is a loss function. If /{'jk) = | 7 fc| and 
Afc = A > 0, then the result is lasso logistic regression [Tibshirani (1996)]. 
The first step of the lasso is to estimate the penalty parameter A, which is 
typically done by cross-validation. The clinical covariates zt are not part of 
the feature selection process in equation ( 2 . 2 ), but they can be added to that 
process if desired. The regularized regression estimates are the solutions to 

(2.3) {a, 5, 7 ) = min Lpenaiized (a, S, 7 ). 

Q:,(5,7 

The minimum can be found by the coordinate descent algorithm [Friedman 
et al. (2008)]. 

2.2. Predicted classification scores. Consider a training set and indepen¬ 
dent validation set. The training set is 

Tj = {{yi,zi,gi),...,{yn,Zn,gn)} 

and the validation set is 

Vk = {{Yfi,zl,g\),...,{Yfi^,zl„gl,)}. 

The minimization in equation (2.3) based on the data set Tj produces esti¬ 
mates {aj,6j,^j). The model is applied to the validation set I 4 , resulting in 
estimated scores 


f '' , r/ V I f 

{aj + 6jZi + 
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Let be the high-dimensional part of the predicted classification 

score for individual i in the validation set. That is, the g'^ is data from 
a high-dimensional technology, such as RNA-seq expression measurements. 
Let Xf = ygV and note that we can write -|- , where = 

( 7 j —'jYgi- (The u superscripts denote unstandardized variables, in contrast 
to standardized versions presented below.) The model of equation (2.1) can 
be written in the form 

P{YY = l\z,,gY) = {1 + Exp[-a - 5'- Xr]}"'. 


Note that, unlike the standard logistic regression model, the variable Xf 
does not have a slope parameter multiple. We could develop the model in 
its present form, but it will simplify presentation if we make it look more 
like the standard model. 

Define g,x = ~ f I'dfid) as the mean of the 'y'gi taken across 

the target population V, where the high-dimensional vectors have density / 
with respect to a measure g,. Similarly, define = Var-p(A“). If these exist, 
then we can standardize the scores 


(2.4) = 








u, 


V 


Wij = 


WY,-gx 




= Xi + U, 


12 •> 


resulting in the EIV logistic regression model 

PiJY = l\zi,X,) = {1 + Exp[-a - 6'z^ -gx- 

= {1 -F Expf-a^, - S'zi - /3ooXi]}~^, 


where ax = a + gx, /3oo = Note that Ep[Xi\ = 0 and Var-p(A'i) = 1. This 
is the EIV model of Carroll et al. (2006). With these adjustments, we can 
apply EIV methods in a straightforward way. 

Suppose we repeatedly draw training sets Tt at random from the popula¬ 
tion V, resulting in Ti, T 2 ,_Each time we apply the developed predictor to 

the validation set I 4 • Each application produces an estimated covariate value 
vector Xt = Wt of length m and corresponding vector of error values Ut, 
where Ut = {Uu, • • •, UmtY = Wt - Xf Define En[Ut] = limi^^o, X Ut 

and XainiUt) = limto_5.oo ~ Xn[Ut]){Ut - En[Ut]Y, that is, these 

are the expectation and variance taken across training samples of size n in 
the population. The derivation of the conditional score method is based on 
an assumption that the Ut are independent and identically distributed Gaus¬ 
sian with En[Ut] = 0 and XainiUt) = where T^uu is a positive definite 
matrix. This assumption can be divided into three component parts: 


(1) The En[Uij\gi] = 0 for z = 1,...,m. Equivalently, En[Wij\gi] = Xt, so 
that the estimated values are unbiased estimates of the population values. 
Intuitively, if ntrain is large enough to develop a good classifier, then this 
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assumption should be approximately true. However, if ntrain is much too 
small, then the estimated scores may be more or less random and not cen¬ 
tered at the true values—so that this assumption would be violated. But 
the assumption is required for identifiability [Dobbin and Song (2013)]. This 
shows that some model violation may be expected for our approach as the 
sample size gets small. 

(2) The Uij have finite variance. This would be true if < oo 

for each i. So, if the regularized linear predictor 7 j has finite second moments 
for training samples of size n, the condition would be satisfied. 

(3) The vector (Uij,... ,Umj) is multivariate normal. This means that 
given G ma t, = {gi,... ,gm), (7j — 7)^G’mat is multivariate normal. This would 
be true if were multivariate normal, and may be approximately true if 
conditions under which 7 j converges to a normal distribution are satisfied 
[e.g., Biihlmann and van de Geer (2011)]. 

To further simplify the model, we assume \ai{Uj) = (j'^Rn where Rn is a 
correlation matrix; in other words, we assume the prediction error variance 
is the same for each individual i. 

2.3. Defining the objective. Define (3j as the slope (associated with the 
Wij) from fitting a logistic regression of 1) on {zi,Wij) across the entire 
population V. In other words, fij is the true slope from a logistic regression 
that uses the training-set-derived Wij as predictors; note that there is one 
well-defined fij for a particular training set. The tolerance is then 

Tol(n) = |/3oo - ^n[/3i]|. 

Under regularity conditions the tolerance will be finite and |£^„[/3j]j < J/IcxdI, 
and lim„_^oo Tol(n) = 0 [Supplement, Section 5.1, Safo, Song and Dobbin 
(2015)]. Note that it is possible to have j£',i[/3j]j > fioo in logistic EIV [Ste- 
fanski and Carroll (1987)]. Let ftarget be the targeted tolerance. The targeted 
sample size ntarget is the solution to 

^target = min{n| Tol(n) < ttarget}- 


2.4. Estimation. Resampling is used to search for ntarget nonparametri- 
cally. This section outlines each step in the estimation process. More detailed 
descriptions appear in the Supplement [Safo, Song and Dobbin (2015)]. 

2.4.1. Estimation for the full pilot data set. Let npUot be the size of the 
pilot data set. The parameter = RnpHaAPj] defined in Section 2.3 can 

be estimated by cross-validation [e.g., Geisser (1993)]. Regularized logis¬ 
tic regression requires specification of a penalty parameter [A in equation 
(2.2)]. Selecting this penalty parameter once using the whole data set re¬ 
sults in biased estimates of predicted classification performance [Ambroise 
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and McLachlan (2002), Simon et al. (2003)]. Therefore, a nested (double) 
cross-validation is required [see, e.g., Davison and Hinkley (1997)]. An inner 
loop is used to select the penalty parameter A; then that penalty parameter is 
used in the outer loops to obtain the cross-validated classification scores. Be¬ 
cause the split of the data set into 5 subsets may impact the resulting nested 
CV slope estimate, we suggest the RCS-CV method; RCS-CV is defined as 
repeating the cross-validation 20-50 times, centering and scaling each cross- 
validated batch, and using the mean of these 20-50 cross-validated slopes as 
the estimate. Centering and scaling of the cross-validated batches is needed 
to reduce error variance due to instability in the lasso regression parameter 
estimates (not shown). We recommend 5-fold cross-validation. 

The cross-validated scores provide an estimate of the slope for a training 
sample of size Upiiot, which we can denote want to apply errors- 

in-variables regression to estimate the tolerance, Tol(npiiot), and for that we 
also need an estimate of the error variance, = Var^^.j^^ {Uij)- The leave- 

one-out bootstrap [e.g., Efron and Tibshirani (1997)] can be used to estimate 
'^^piiot • Because tuning parameters must be selected in regularized regression, 
a nested, case-cross-validated leave-one-out bootstrap (LOO-NCCV-BS) is 
required [see, e.g., Varma and Simon (2006)]. Letting Wij^bs represent these 
bootstrap scores for z = 1,..., npUot and j = 1,..., 6o) where bo is the number 
of bootstraps for each left-out case, then the estimate of is 


^pilot 6o 


( 7 , 


npiiotibo - 1 ) ^ 




2 


where As with the CV described in the previous 

paragraph, one needs to standardize the cross-validated bootstrap batches 
to have mean zero and variance 1. This is the CS-LOO-NCCV-BS proce¬ 
dure. Note that in practice the leave-one-out bootstrap is performed using 
a single bootstrap and collating the results appropriately, which reduces the 
computation cost [Davison and Hinckley (1997)]. 

Now the is “plugged in” to a univariate EIV logistic regression 

which also uses the nested CV predicted classihcation scores as the Wij 
in equation (2.4). The conditional score method of Stefanski and Carroll 
(1987), with the Hanfelt and Liang (1997), equation (3), modification, is 
used to estimate the asymptotic slope (3oo associated with the Aj. Briefly, if 
we write the logistic density of equation (2.3) in the canonical generalized 
linear model form 


\ T? ( yi{a + 6'Zi + PooXi) - b{a + 6'Zi +13, 

- 


'oo-^O 
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where the functions are a((/>) = l,b{x) = ln{x), = 0, then letting 9 = 

{a,S,l3ooy the conditional score function for 6 has the form 

/ {yi-E[yi\Aei]) \ 

I Zi{yi - E[yi\Aei]) | , 

* \xi{yi-E[yi\Agi]) / 

where Ag^ = Wij + yi'^Poo, Xi is an estimator of Xi based Ag^, and ^ = 
Varnpii„t(^h)a('^)- 

The conditional score method produces /3oo- The tolerance is then esti¬ 
mated with Tol(npiiot) = |/3oo - /^npUotl- 

2.4.2. Estimating tolerance for subsets of the pilot data set. Typically, 
Tol(npiiot) will be larger or smaller than ftarget, the targeted tolerance. In 
either case, more information about the relationship between Tol(n) and n is 
needed to estimate ntarget- Such information can be obtained by subsampling 
from the pilot data set. We suggest 7 subsets with a range of sizes be taken 
from the pilot data set. Each subset should be large enough, as dehned in 
Section 2.5 below. For example, Upiiot x k/1 for A: = 1,..., 7 can be used. More 
typically, if the pilot data set is not as large, then one may use (npiiot/2) + 
k/6* (npiiot/2) for /c = 0,..., 6. If npiiot/2 is not large enough, then the pilot 
set is probably inadequate. 

For each subset size less than Upiiot, call them n^,... ,ng, take a random 
sample from the full data set without replacement. Then apply the procedure 
described for the full pilot data set to each subset and obtain Tol(np, k = 
1,..., 6. The only modification we suggest to the original RCS-CV procedure 
is that for the 20-50 repetitions, take a different random sample each time. 

2.4.3. Estimation of htarget- Analysis of a pilot or simulated data set 

produces sample sizes n| < < • • • < ny and corresponding tolerance es¬ 

timates ti = Tol(ny),..., t 7 = To^ny). Fit the Box-Cox regression model 
(tf — 1)/K = 6o + 6in* + Si to obtain k, and define h{x) = (x'^ — l)/k, k 7 ^ 0 
and h{x) = Ln{x),k = 0. Then ht with least squares n* = t] + (h{ti), which 
produces 7 } and f. Finally, if itarget > 0 is the desired tolerance, the sample 
size is 

htarget — 7? T Ch(ttarget)' 

As discussed above, we recommend estimating each tolerance 20-50 times by 
repeated random sampling, say, tiq,... ,ti ,20 and estimating ^ 

Note that the Box-Cox method requires that only values of target > 0 be 
considered. Also, in our modeling context, target < 0 does not make intu¬ 
itive sense since the tolerance should always be positive. Theoretically, the 
sample size estimate should be oo when target = 0. But, depending on the 
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estimates fj, C, k, this may not be the case. Our advice is that the estimated 
sample sizes should increase as ttarget decreases over the range of sample 
sizes considered. 

2.5. Are there enough samples in the pilot data set? The nested resam¬ 
pling methods in our approach require there be adequate numbers in the sub¬ 
sets. If there are npHot in the pilot data set, then a bootstrap sample will con¬ 
tain on average 0.632 x npiiot unique samples. A 5-fold case cross-validation 
of the bootstrap sample will result in 0.2 x 0.632 x Upiiot = 0.13 x Upiiot in 
a validation set. Since the validation set scores will be normalized to have 
mean zero and variance 1, we recommend at least 80 samples in the training 
set to ensure at least 10 samples in these cross-validated sets. If the class 
prevalence is imbalanced, this number should be increased. In particular, we 
recommend the following: 

Condition 1. If npiiot is the size of the pilot data set, then UpHot x 
0.13 X TTiowest > 5, where vriowest is the proportion from the underrepresented 
class. 

The conditional score methods will not work well as the error variance 
gets large. Since the conditional score methods are repeated 20-50 times for 
each subset size in the RCS-CV procedure, the stability of these estimates 
can be evaluated. Therefore, the following guideline is advised: 

Practical guideline 1. If the conditional score errors-in-variables re¬ 
gression estimates display instability for any subsample size, use quadratic 
SIMEX errors-in-variables regression instead. An example of instability would 
be I mean(/3oo)/s.d.(/3oo)| <0.5, where the mean and standard deviation are 
taken across the 20-50 replicates. 

Resampling-based approaches to sample size estimation require that the 
relationship between the asymptotic model and the estimated model can be 
adequately estimated from the pilot data set. Trouble can arise if the learning 
pattern displayed on the pilot set changes dramatically for sample sizes 
larger than the pilot data set. For example, there may be no classification 
signal detectable with 3 samples per class, but one is detectable with 50 
samples per class. So a pilot data set of 6 would lead to the erroneous 
conclusion that the asymptotic error rate is 50%, and any resulting sample 
size estimates would likely be erroneous. Similarly, the learning process can 
be uneven, so that the asymptotic error rate estimate increases or decreases 
as the sample size increases. The latter can happen when some subset of 
the features have smaller effects than others and are only detected for larger 
sample sizes. To guard against this in simulations, at least, we found that 
the following guideline is useful: 
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Condition 2. The predictor needs to find the important features re¬ 
lated to the class distinction with power at least 85%. 

Our simulation-based software program checks the empirical power for 
this condition. In the context of resampling from real data, it is not clear 
how one could verify this assumption empirically. But it may be possible to 
evaluate the effect size associated with this power by a parametric bootstrap. 


2.6. Translating between logistic slope and misclassification accuracy. If 
there are no clinical covariates in the model, then the misclassification error 
rate for the asymptotic model is [e.g., Efron (1975)] 


P{Yi = 1 and a + < 0) -|- P{Yi = 0 and a -I- fioo^i > 0) 

/ -«//3oo gO+/3ooAi j-QO 2 

-To— ^fx(x)dx+ / --3— ^fx(x)dx, 

-oo l + I i + I 

where fx{x) is the marginal density of the asymptotic scores across the 
population V. By definition, these scores have mean zero and variance one. 
If we further assume the scores are Gaussian, then the misclassification rate 
can be estimated with 


"^0 r a+PooXi 


E 

2 = 1 




l{xi < -a/Poo) + 


mo 

E 

2 = 1 


1 


1 -L 


l{xi > -a/Poo), 


where 1(A) is the indicator function for event A and mo is a number of 
Monte Carlo simulations, and xi,... ,Xmo are drawn from the distribution 
Xi ~ Normal(0,1). If covariates are added to the model, then the conditional 
distribution of Xi\zi needs to be used for the Monte Carlo. If Xi is independent 
of Zi, then the Xi could be generated from a standard normal, and the Monte 
Carlo equations modified in the obvious way. The Supplement [Safo, Song 
and Dobbin (2015)] shows a graph of the no covariate case relationship. 


3. Results: Simulation studies. For the simulation studies, high-dimen¬ 
sional data were generated from multivariate normal distributions, both 
a single multivariate normal and a mixture multivariate normal with ho- 
moscedastic variance. Both multivariate normal settings performed simi¬ 
larly [see Supplement, Safo, Song and Dobbin (2015)], so we just present 
one in the paper. The covariance matrices were identity, compound sym¬ 
metric (CS) and autoregressive order 1 [AR(1)], as indicated. Class labels 
were generated from the linear logistic regression model of equation (2.1). 
Categorical clinical covariate data, when included in simulations, were gen¬ 
erated from a distribution with equal probability assigned to each of three 
categories, where categories are correlated with class labels. 
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Table 1 

Estimates of the asymptotic slope f5oo and corresponding accuracy accoo evaluated by 
simulations, npiiot is the number of samples in the pilot data set. The covariance 
structure “Cov” are as follows: ARl is block autoregressive order 1 in 3 blocks of size 3 
(9 informative features) with parameter 0.7; Men. is identity with 1 block of 1 (1 
informative feature). Total of p = 500 features; all noise features independent standard 
normal. Summary statistics based on 200 Monte Carlo. More results appear in the 
Supplement [Safo, Song and Dobbin (2015)] 


Logist. slope 


Class. Acc. 


’^pilot 

Cov 

/3oo 

mean 

3.CCoo 

acCoo mean 

mean 

mean ( 3 , 

300 

ARl 

2.0 

2.07 

0.778 

0.783 

0.43 

1.49 

400 

ARl 

2.0 

2.01 

0.778 

0.779 

0.35 

1.62 

300 

ARl 

3.0 

3.04 

0.836 

0.838 

0.32 

2.31 

400 

ARl 

3.0 

2.93 

0.836 

0.834 

0.27 

2.47 

300 

ARl 

4.0 

3.95 

0.871 

0.869 

0.28 

3.06 

400 

ARl 

4.0 

3.88 

0.871 

0.868 

0.23 

3.26 

300 

ARl 

5.0 

3.77 

0.894 

0.865 

0.25 

3.71 

400 

ARl 

5.0 

4.81 

0.894 

0.891 

0.21 

3.99 

300 

Iden. 

2.0 

2.05 

0.778 

0.781 

0.23 

1.87 

400 

Iden. 

2.0 

2.01 

0.778 

0.778 

0.19 

1.90 

300 

Iden. 

3.0 

3.02 

0.836 

0.836 

0.17 

2.85 

400 

Iden. 

3.0 

2.98 

0.836 

0.835 

0.14 

2.87 

300 

Iden. 

4.0 

3.97 

0.871 

0.870 

0.14 

3.72 

400 

Iden. 

4.0 

3.92 

0.871 

0.869 

0.12 

3.75 

300 

Iden. 

5.0 

4.94 

0.894 

0.893 

0.14 

4.50 

400 

Iden. 

5.0 

4.86 

0.894 

0.891 

0.12 

4.55 


The asymptotic slope parameter /3oo must be estimated. Table 1 presents 
a simulation to evaluate the bias and variance of the asymptotic slope pa¬ 
rameter estimate j3oo- Also presented are the corresponding estimates of 
asymptotic classification accuracy accoo- As can be seen from the table, this 
approach does well overall at estimating the asymptotic performance for 
these pilot data set sample sizes (300 and 400), asymptotic slopes (2,3,4,5), 
multivariate normal high-dimensional data, covariance matrix structures 
(Identity, CS [Supplement, Safo, Song and Dobbin (2015)] and ARl) and 
numbers of informative features (1 and 9). There is some small bias appar¬ 
ent as the slope becomes large (/loo = 5), probably reflecting the fact that 
large slopes are problematic for EIV logistic regression. 

The tolerance associated with the estimated sample size should be within 
the user-targeted tolerance. To test this, sample sizes were calculated by 
applying the method to simulated pilot data sets. Then, these sample size 
estimates were assessed by performing very large pure Monte Carlo studies. 
Table 2 presents sample size estimates from our method and sample statistics 
from the Monte Carlo (MC) simulations. The mean tolerances from the MC 
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Table 2 

Evaluation of the sample size estimates from AR{1) and identity covariances. The 
number in the pilot data set is 400. /3oo =4. Identity covariance had one informative 
feature, and AR{1) had nine informative features in a block structure of 3 blocks of size 3 
with correlation parameter 0.7. Estimates evaluated using 400 Monte Carlo simulations 
with the estimated sample size. The mean tolerance from the 400 simulations and the 
proportion of the 400 within the specified tolerance are given in the rightmost two 
columns. The dimension is p = 500 


Cov. 

^target 

ft 

Mean MC tol. 

% of MC within tol. 

ARl 

0.10 

1742 

0.09 

64% 

ARl 

0.20 

986 

0.19 

62% 

ARl 

0.30 

715 

0.27 

67% 

ARl 

0.40 

573 

0.34 

71% 

ARl 

0.50 

484 

0.43 

72% 

ARl 

0.60 

424 

0.49 

75% 

ARl 

0.70 

380 

0.57 

77% 

Identity 

0.10 

509 

0.09 

79% 

Identity 

0.20 

322 

0.10 

87% 

Identity 

0.30 

242 

0.15 

87% 

Identity 

0.40 

194 

0.16 

92% 

Identity 

0.50 

162 

0.21 

90% 

Identity 

0.60 

139 

0.25 

93% 

Identity 

0.70 

121 

0.31 

91% 


are all within the targeted tolerance, indicating that the estimated sample 
sizes do achieve the targeted tolerance. The method tends to produce larger 
sample size estimates than required with 62%-93% of the true tolerances 
within the target (rightmost column). Note that our method guarantees 
that the expected slope is within the tolerance, but not that the actual 
slope is within the tolerance; this latter would be a stronger requirement. 

Implementation of our approach in the presence of clinical covariates was 
evaluated. Table 3 shows results when a clinical covariate is included into the 
setting. In this case the clinical covariate is also associated with the class 
distinction; in particular, in equation (2.1), 6 = Ln(2) and Zi G { — 1,0,1}, 
with l/3rd probability assigned to each value. As can be seen by comparison 
with Table 2, the addition of the clinical covariate significantly increases the 
required sample sizes. For example, the estimated sample size for a tolerance 
of 0.20 increases 29%, from 322 to 416. This increase reflects correlation 
between the clinical covariate and the class labels. The pure Monte Carlo 
evaluations in Table 2 show that the method does still produce adequate 
sample size estimates in the presence of the clinical covariate. 

Figure 1 is a summarization of results from all the different simulation 
studies. Negative values on the y-axis mean the sample size was overesti¬ 
mated, and positive values mean the sample size was underestimated. As 
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Table 3 

Clinical covariate simulations. One clinical covariate with 3 levels which are associated 
with the class distinction. The identity covariance and an asymptotic true slope of 
j3oo = 4. The dimension is p = 500. See text for more information 


^target 

h 

Mean MC tol. 

% of MC within tol. 

0.1 

592 

0.07 

80% 

0.2 

416 

0.09 

88% 

0.3 

334 

0.12 

92% 

0.4 

284 

0.14 

91% 

0.5 

249 

0.15 

95% 

0.6 

223 

0.17 

92% 


can be seen in the figure, the sample size estimates are mostly adequate or 
conservative. When the estimated sample size required is smaller than the 
pilot data set (x-axis values are negative), the resulting tolerance estimates 
are adequate or conservative; intuitively, identifying a sample size smaller 
than the pilot data set should be relatively easy. When the estimated sam¬ 
ple size required is larger than the pilot data set, the method continues to 
perform well overall. The exceptions are in the cases of compound symmet- 


Summary of sample size simulations 








aV 






V 


,.4 


^ - ^- 








o CS, slope 3 
A AR1, slope 3 
+ Iden, Slope 3 
X Iden, Slope 3, Clin 
O Iden, slope 4, Clin 
V Iden, slope 4 
K) CS, slope 4 
* AR1, slope 4 


-1 


Log2[(Est. n)/{Pilot n)] 


Fig. 1. Summary of results of simulations. The x-axis is the base 2 logarithm of the ratio 
of the estimated training sample size required divided by the pilot training sample size used. 
The y-axis is the average tolerance estimated from pure Monte Carlo simulations minus 
the targeted tolerance. 
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ric and ARl covariance with a small slope of 3; in these cases, the y-values 
are positive, indicating anti-conservative sample size estimates. The prob¬ 
lem here seems to be the power to detect the features. For the compound 
symmetric simulations, the empirical bootstrap power was 7.67/9 = 85.2%, 
and for ARl simulations, the power was 84.7%. Both are near the cutoff of 
the 85% power criterion developed in Section 2.5 above. Still, overall, the 
method seems to perform well. 

3.1. LC and EIV performance in simulations. We evaluated both our 
resampling-based method and the resampling-based method of Mukherjee 
et al. (2003) using pure Monte Carlo estimation of the truth in simulations. 
We will denote their method by LC (for learning curve) and our method by 
EIV (for errors-in-variables). Figure 2 shows a comparison of the two meth¬ 
ods under a range of simulation settings. In these simulations, our method 
may have an advantage because the logistic regression model was used to 
generate the response data. Tolerances of 0.1 and 0.2 were considered since 
these are associated with larger training sample sizes than the pilot data set. 
Comparing the percentage error of the sample size estimate to an estimate 
based on pure Monte Carlo, one can see that the learning curve method has 
an error an order of magnitude or more larger than our method. The LC 
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Eig. 2. Sample size estimates for 6 simulated data sets with tolerance set to 0.1. Compar¬ 
ison of errors-in-variables, learning curve, Dobbin and Simon (2007) (with false positive 
rate set to at most 1) and the true values (from pure Monte Carlo). Similar results for 
tolerance = 0.2 appear in the Supplement [Safo, Song and Dobbin (2015)]. Y-axis is on 
the log scale. Size of pilot data set is 300. 
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method tends to consistently overestimate the sample size in these simula¬ 
tions. In sum, the EIV method estimates were closer to the true sample size 
values than the LC method estimates across all of these simulations. 

Finally, we applied the method of Dobbin and Simon (2007) to the same 
data sets. As background, this method assumes a prespecified significance 
level cutoff for features, hence, it is not well-suited to the lasso logistic 
regression. Some ad hoc procedure is needed to approximate the lasso. We 
examined an approach that (1) picks the optimal significance level cutoff 
(likely to be anti-conservatively biased because of data re-use), and one 
that (2) picks a significance level that controls the expected number of false 
positives to be at most 1 (in other words, p-value cutoff 1/n). The sample size 
estimates for approach (1) were 126, 404, 366, 116, 404 and 374, for data sets 
1-6, respectively. Perhaps not surprisingly, this approach underestimates 
the sample size requirement for lasso logistic regression. The sample size 
estimates for approach (2), using a grid of size 100, were 6700, 1700, 1600, 
8100, 1800 and 1900. The results are similar to those for the learning curve. 

4. Real data set applications. The methods are applied to four data 
sets. The purpose of the first three applications is two-fold. First, we deter¬ 
mine the adequacy of the sample sizes of these studies using both sample 
size methods. Second, since microarray data may violate the normality as¬ 
sumption of the simulations, we evaluate the resampling performance of the 
methods as a check on their performance with this nonnormal data. The 
purpose of the fourth application is primarily to evaluate the ability of the 
methods to estimate sample size on RNA-seq data. 

A resampling study can be used to compare estimates from a procedure 
to a resampling-based “truth.” Since adequate sample sizes are unknown 
on these data sets, it was not feasible to compare sample size estimates to 
any corresponding estimated true values. But we can compare the error rate 
estimated from a subset of the data set to an independent estimate based 
on cross-validation on the whole data set. 

4.1. The lymphoma data set. We applied the FIV and the LC method 
to the data set of Rosenwald et al. (2002). The classes were germinal center 
B-cell lymphoma versus all other types of lymphoma. We used both methods 
to evaluate whether the sample size used in this study was adequate, and we 
subsetted 5 “pilot data sets” of size 100 at random from this data set. For 
each of these “pilot data sets,” we estimated the performance when n = 240 
are in the training set. Then, we could compare the estimated performance to 
a “gold standard” resampling-based performance on the full data set. Results 
are shown in Table 4. The two methods agree well on this data set, and both 
indicate that on average the pilot data set size of n = 240 provides close to 
the optimal accuracy possible, within 0.02. Comparing the two approaches 
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Table 4 

Resampling studies. Data set is the data set used for resampling. Rep is the replication 
number of 5 independent random subsamples (without replacement) of size nPilot. nFull 
is the size of the full data set. Classes for the Shedden data set were Alive versus Dead. 
Classes for the Rosenwald data set were Germinal-Center B-Cell lymphoma type versus 
all others. err{nFull) is estimated from 200 (50 for Shedden) random cross-validation 
estimations on the full data set using different partitions eaeh time, and this serves as 
the gold standard error rate for nFull. err{nFull) is the estimated error rate for the full 
data set based on the LC method or EIV method. Similarly, efr{oo) is the asymptotic 
error rate based on the LC method or EIV method. The first column is the dataset, “R” 
for Rosenwald and “S” for Shedden. Eor the Shedden data set, we used conditional score 
EIV; for the Rosenwald data set, we used quadratic SIMEX EIV because the criterion for 
the conditional score was violated (Section 2.5) 


Data 

Rep nPilot nFull 

err(nFull) 

LC method 

EIV method 

nFull 

err % 

err(nFull) 

err(oo) 

err(nFull) 

err(oo) 

LC 

EIV 

R 

1 

100 

240 

0.1129 

0.0855 

0.0729 

0.1344 

0.1135 

-25% 

19% 

R 

2 

100 

240 

0.1129 

0.0611 

0.0435 

0.1078 

0.0933 

-46% 

-5% 

R 

3 

100 

240 

0.1129 

0.0298 

0.0089 

0.0771 

0.0691 

-74% 

-32% 

R 

4 

100 

240 

0.1129 

0.1443 

0.1270 

0.1396 

0.1379 

28% 

24% 

R 

5 

100 

240 

0.1129 

0.0682 

0.0480 

0.0864 

0.0783 

-40% 

-23% 

Mean 




0.1129 

0.0778 

0.0601 

0.1091 

0.0984 

-31% 

-3% 

S 

1 

200 

443 

0.4207 

0.4638 

0.4634 

0.4347 

0.4347 

10% 

3% 

S 

2 

200 

443 

0.4207 

0.4496 

0.4481 

0.4154 

0.4151 

7% 

-1% 

S 

3 

200 

443 

0.4207 

0.4300 

0.4258 

0.2778 

0.2778 

2% 

-34% 

S 

4 

200 

443 

0.4207 

0.4166 

0.4126 

0.3550 

0.3550 

-1% 

-16% 

S 

5 

200 

443 

0.4207 

0.4159 

0.4117 

0.2907 

0.2894 

-1% 

-31% 

Mean 




0.4207 

0.4352 

0.4323 

0.3548 

0.3544 

3% 

-16% 


in their ability to estimate the error rate for n = 240 based on a pilot data set 
of only n = 100, the EIV method has better mean performance in terms of 
estimating the full data set error than the LC method. Here, the differences 
are less dramatic than the sample size differences; this may be due to the 
sensitivity of sample size methods to relatively small changes in asymptotic 
error rates or to the underlying data distribution. Both methods show some 
variation in error rate estimates across the five subsets of size 100. 

4.2. The lung cancer data set. We next applied both methods to the 
lung cancer data set of Shedden et al. (2008), where the classes were based 
on survival status at last follow-up. This binary indicator approach was used 
for predictors developed in the original paper. In this case, the two methods 
produce similar results. Both methods indicate here that on average the 
asymptotic performance is extremely close to the full data set performance 
when n = 443, with a difference in error rate of less than 0.01. Comparing the 
two methods in terms of their ability to estimate the error rate for n = 443 
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based on a pilot data set of only n = 200, the LC method was slightly 
better on average than the EIV with conditional score based on percentage 
error (rightmost columns of table); but the conditional score criterion in 
Section 2.5 was exceeded on 3 of the 5 data sets, and if quadratic SIMEX 
is used, then the LC and EIV are almost identical [Supplement, Safo, Song 
and Dobbin (2015)]. This is a very noisy problem and classification accuracy 
based on a training set of all 443 samples is only estimated to be around 
56%-60%. 

In order to evaluate the performance of our approach using modifications 
of the lasso, we next applied our EIV method using the elastic net [Zou and 
Hastie (2005)] to the lung cancer data set. In 3 of the 5 resampled data 
sets of size 200, the results were almost identical to the lasso; in one case 
(data set 5) the ratio of the estimate to its standard deviation was below 
the 0.5 cutoff for all sample sizes less than 183 and, not surprisingly, our 
method did not work in that case; in another resampled data set (data set 
3), our method produced larger sample size estimates for the elastic net than 
the lasso. Details are shown in the Supplemental Material [Safo, Song and 
Dobbin (2015)]. 

4.3. Prostate cancer microarray data set. We next applied our method 
to the prostate cancer data set described in Dettling and Biihlmann (2003). 
This data set consisted of a total of 102 human samples, 52 from prostate 
tumors and 50 from nontumor prostate tissue samples. Intuitively, we may 
expect that classification of samples into tumor versus nontumor would be 
a relatively easy problem. For this data set, the EIV sample size estimates 
for tolerances of 0.10, 0.30, 0.50 were 164, 119 and 103, respectively. Inter¬ 
estingly, these results suggest that the sample size used (102) for the pilot 
study is inadequate for producing a classifier with a tolerance closer than 
0.50 to the optimal value. The cross-validated misclassihcation rates for the 
prostate cancer data set reported in the (2003) paper ranged from 4.9% 
to 13.7%. The average of 20 resamplings from our approach resulted in an 
estimated /3io2 = 3.55, corresponding to an error rate of 14%. In this applica¬ 
tion, the estimated Poo averaged over 20 resamplings was 4.16 with standard 
error 0.12; the 4.16 corresponds to an error rate of 12%. Note that here the 
relatively large tolerance of 0.5 is associated with a small increase in accu¬ 
racy because the asymptotic slope estimate is relatively large. In conclusion, 
our method produces results that are in accord with intuition in that the 
sample size used produces a classiher with accuracy close to the optimal for 
this easier classification problem. But we also observed large fluctuations 
in the /3oo estimate when the resampled data sets were between 51 and 93, 
which suggests that the asymptotic estimate of a 12% error rate is quite un¬ 
stable. We won’t speculate as to the cause of this instability, but intuitively 
one would expect that a smaller error rate would be possible. Importantly, 
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the instability of the asymptotic estimate does not seem to compromise the 
sample size estimate. 

4.4. The RNA-seq data set. We performed a proof of principle study to 
see if these methods could be applied effectively to RNA-seq data. First, 
note that RNA-seq data after being processed may be in the form of counts 
(e.g., from the Myrna algorithm), but are more often in the form of contin¬ 
uous values (e.g., normalized Myrna data, or FPKM fragments-per-kilobase 
of exon per million fragments mapped from Cufflinks or other software). 
Therefore, linear models with continuous high-dimensional predictors are 
reasonable to use for RNA-seq data. But it is important to check that the 
processed data appear reasonably Gaussian and, if not, to transform the 
data. 

We applied the LC and EIV methods to the Drosophila melanogaster 
data of Graveley et al. (2011). Processed data were downloaded from the 
ReCount database [Frazee, Langmead and Leek (2011)]. Variables with more 
than 50% missing data were removed. Remaining data were truncated below 
at 1.5 and log-transformed. Low variance features were filtered out, resulting 
in p = 500 dimensions. Since this was a highly controlled experiment with 
large biological differences between the fly states, some class distinctions 
resulted in separable classes. Logistic regression is not appropriate for per¬ 
fectly separated data. Samples were split into two classes: Class 1 consisted 
of all the embryos and some adult and white prepupae (WPP); Class 2 con¬ 
sisted of all the larvae and a mix of adults and WPP. The class sizes were 82 
and 65. A principal component plot is shown in the Supplement [Safo, Song 
and Dobbin (2015)]. The data set consisted of a total of Upiiot = 147 data 
points. Technical replicates in the data created a clustering pattern visible 
in principal components plots. This type of clustering is often observed in 
cancer patient data sets due to disease subgroupings. We did not attempt to 
adjust the analysis for the technical replicates. The resulting EIV method 
equation for the sample size was 

/t-0-3434 _iX 

h = 105.73 - 14.25 -—— . 

V (-0.3434) J 

Eor tolerances of 0.1, 0.05 and 0.02, sample size estimates were 156, 180 and 
223, respectively. The cross-validated accuracy, averaged over 10 replica¬ 
tions, was 91%. Based on the /3oo = 4.55, the optimal accuracy is 88.5%, and 
the full data set accuracy is 88%, corresponding to /3i47 = 4.42 = 4.55 — 
Tol(147). The conditional score was used for the EIV method. The LC 
method curve was err = 0.075 -|-1.252 x 7),-0-9i2232i _ asymptotic accuracy 
estimate is 92.5%, corresponding to /3oo = 7.2, and the estimated accuracy 
when n = 147 is 91.2%, corresponding to /3„ = 6.1. The LC sample size es¬ 
timates for tolerances of 0.10, 0.05 and 0.02 were 1,383, 8,361 and 13,342, 
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respectively. As with the simulation studies, the LC method estimates are 
much larger than the EIV estimates. 

The difference in performance of the two methods on the RNA-seq data 
is probably attributable to reduced noise on this data set, resulting in a set 
of genes with large differences between the classes relative to the noise level 
present. This may make the fly data set similar to some of the simulated 
data sets, where large differences were observed between LC and EIV per¬ 
formance. This different structure of the data compared to the microarray 
data sets may be due to the larger biological variability between the ffy 
states, or a reduction in noise variation due to the RNA-seq platform, or a 
combination of both. 

5. Discussion. In this paper we studied the problem of sample size esti¬ 
mation for regularized logistic regression classification in cancer. Two meth¬ 
ods of sample size estimation were studied in simulations and applications. 
The simulation results suggested that the EIV method works well and that 
the LC method sometimes works but is sometimes overly conservative. The 
methods were applied to a lymphoma data set, a lung cancer data set, a 
prostate cancer data set and an RNA-seq data set. The results in lym¬ 
phoma and lung cancer suggest that these studies had adequate sample 
sizes already, and that larger studies are unlikely to yield better classifiers. 
For the prostate cancer data set, the analysis revealed that the pilot data 
set size was inadequate, resulting in high variation in the predicted classifi¬ 
cation scores. The RNA-seq data analysis showed that the EIV method also 
appears to work well on this type of data, but a critical problem is the lack of 
publicly available and accessible RNA-seq data sets that could serve as pilot 
data sets. This observation highlights a critical existing log-jam in medical 
research, the problem of the lack of availability of RNA-seq data sets (or 
modified versions thereof) for study planning purposes. In the meantime, 
existing microarray data sets and/or simulated data sets must be used for 
RNA-seq study planning. 

A new sample size method for training regularized logistic regression- 
based classifiers was developed. The method exploits a structural similarity 
between logistic prediction and errors-in-variables regression models. The 
method was shown to perform well when an adequate pilot data set is avail¬ 
able. Methods for assessing the adequacy of a pilot data set were developed. 
If no adequate pilot data set is available, the method can be used with Monte 
Carlo samples from a parametric simulation. 

An important issue in using either the LC or EIV method is the fitting of 
the curve that produces the final sample size estimate. In the LC method, 
as described in Mukherjee et al. (2003), a constrained least squares opti¬ 
mization must be performed on a nonlinear regression model. Constrained 
optimization methods like the L-BFGS-B algorithm used in the application 
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of the Mukherjee et al. (2003) method in this paper may produce different 
solutions than standard, unconstrained least squares optimization methods 
such as Nelder-Mead. In contrast, the Box-Cox algorithm and linear regres¬ 
sion fitting used by our approach are more straightforward to implement. 
Because our method does not need to “extrapolate to inhnity” as the typical 
learning curve method requires, the regression model is chosen that fits the 
best in the vicinity of the data points. This simplifies the fitting procedure, 
albeit at the cost of the errors-in-variables regression step. For both meth¬ 
ods, it is advisable to look at the final plot of the htted line and the data 
points as a basic regression diagnostic. 

The reader may have noted that the variance parameter cr^ = Var„(C/jj) 
is estimated by bootstrapping the pilot data set. But the variance is defined 
as a variance across independent training sets of size n in the population. 
Since the bootstrap data sets will have overlap, obviously there is potential 
bias in the bootstrap estimation procedure. Whether the bootstrap could be 
modified to reduce this bias is a potential area for future work. 

If more than two classes are present in the data, then simple regularized 
logistic regression is no longer an appropriate analysis strategy. In order to 
apply our method in that setting, regularized methods for more than two 
classes would need to be developed, for example, regularized multinomial or 
ordinal logistic regression methods. Also, corresponding errors-in-variables 
methods for these multi-class logistic regression methods would be needed. 
It appears that both these would be prerequisites to such an extension. 

If classes are completely separable in the high-dimensional space, then 
regularized logistic regression is not advisable because the logistic regression 
slope will be undefined and the logistic htting algorithms will become unsta¬ 
ble. The approach presented in this paper cannot be used in that context. 

In this paper we have focused simulations on settings with equal preva¬ 
lence from each class. If the class prevalences are unequal, then the method 
can still be applied as presented in the paper—as was done in the applica¬ 
tions to the real data sets, for example. However, if the imbalance is large 
(e.g., 90% versus 10%), then the training set size required by our Condition 1 
in Section 2.5 would likely be excessive. But, in this case, it is also less likely 
that accuracy will be the objective criteria for model selection because a 
high accuracy may be associated with a classifier that puts most subjects in 
the majority class. Ideally, positive and negative predictive values and their 
associated clinical implications would likely be more useful criteria. This is 
a potential future direction of research. 

SUPPLEMENTARY MATERIAL 

Supplemental tables, figures, algorithms, details and discussion (DOI: 
10.1214/15-AOAS825SUPP; .pdf). Supplemental material for paper by Safo, 
Song and Dobbin. 
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